|
|
@@ -16,7 +16,7 @@ module Agents
|
16
|
16
|
description <<-MD
|
17
|
17
|
The Website Agent scrapes a website, XML document, or JSON feed and creates Events based on the results.
|
18
|
18
|
|
19
|
|
- Specify a `url` and select a `mode` for when to create Events based on the scraped data, either `all` or `on_change`.
|
|
19
|
+ Specify a `url` and select a `mode` for when to create Events based on the scraped data, either `all`, `on_change`, or `merge` (if fetching based on an Event, see below).
|
20
|
20
|
|
21
|
21
|
`url` can be a single url, or an array of urls (for example, for multiple pages with the exact same structure but different content to scrape)
|
22
|
22
|
|
|
|
@@ -37,7 +37,7 @@ module Agents
|
37
|
37
|
|
38
|
38
|
# Scraping HTML and XML
|
39
|
39
|
|
40
|
|
- When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`. It then evaluates an XPath expression in `value` (default: `.`) on each node in the node set, converting the result into string. Here's an example:
|
|
40
|
+ When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`. It then evaluates an XPath expression in `value` (default: `.`) on each node in the node set, converting the result into a string. Here's an example:
|
41
|
41
|
|
42
|
42
|
"extract": {
|
43
|
43
|
"url": { "css": "#comic img", "value": "@src" },
|
|
|
@@ -45,11 +45,11 @@ module Agents
|
45
|
45
|
"body_text": { "css": "div.main", "value": ".//text()" }
|
46
|
46
|
}
|
47
|
47
|
|
48
|
|
- "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and ".//text()" is to extract all the enclosed texts. To extract the innerHTML, use "./node()"; and to extract the outer HTML, use ".".
|
|
48
|
+ "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and `.//text()` extracts all the enclosed text. To extract the innerHTML, use `./node()`; and to extract the outer HTML, use `.`.
|
49
|
49
|
|
50
|
|
- You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove comma from a formatted number, etc. Note that these functions take a string, not a node set, so what you may think would be written as `normalize-space(.//text())` should actually be `normalize-space(.)`.
|
|
50
|
+ You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove commas from formatted numbers, etc. Note that these functions take a string, not a node set, so what you may think would be written as `normalize-space(.//text())` should actually be `normalize-space(.)`.
|
51
|
51
|
|
52
|
|
- Beware that when parsing an XML document (i.e. `type` is `xml`) using `xpath` expressions all namespaces are stripped from the document unless a toplevel option `use_namespaces` is set to true.
|
|
52
|
+ Beware that when parsing an XML document (i.e. `type` is `xml`) using `xpath` expressions, all namespaces are stripped from the document unless the top-level option `use_namespaces` is set to `true`.
|
53
|
53
|
|
54
|
54
|
# Scraping JSON
|
55
|
55
|
|
|
|
@@ -92,7 +92,7 @@ module Agents
|
92
|
92
|
|
93
|
93
|
Set `uniqueness_look_back` to limit the number of events checked for uniqueness (typically for performance). This defaults to the larger of #{UNIQUENESS_LOOK_BACK} or #{UNIQUENESS_FACTOR}x the number of detected received results.
|
94
|
94
|
|
95
|
|
- Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
|
|
95
|
+ Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid, or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
|
96
|
96
|
|
97
|
97
|
Set `user_agent` to a custom User-Agent name if the website does not like the default value (`#{default_user_agent}`).
|
98
|
98
|
|
|
|
@@ -343,7 +343,7 @@ module Agents
|
343
|
343
|
if url_template = options['url_from_event'].presence
|
344
|
344
|
interpolate_options(url_template)
|
345
|
345
|
else
|
346
|
|
- event.payload['url']
|
|
346
|
+ event.payload['url'].presence || interpolated['url']
|
347
|
347
|
end
|
348
|
348
|
check_urls(url_to_scrape, existing_payload)
|
349
|
349
|
end
|